The tl;dr version: in a given box of Lucky Charms, for each bowl eaten we estimate a decrease of approximately 2.7 total charms per bowl on average. The weight of cereal also appears to be significant and for an increase of 1g of cereal we estimate an increase of approximately 0.5 charms on average—with bowl held constant. The interaction between bowl and weight is not statistically significant.

See this GitHub repository for all of the data, code, photos, etc.

Background and introduction

Figure 1. They’re Magically Delicious!
Figure 1. They’re Magically Delicious!

Background

In the early 2010’s there were a series of articles that circulated on the Internet about somebody investigating whether or not “Double Stuf” Oreos were truly double-stuffed. (They’re not.) It was a neat idea, and a substantial amount of material has been written about it in the years since, see here to get started. It evidently caused enough splash that some teachers were said to be using the experiment as an activity in their classrooms, and students locally have reported performing the experiment at their own school.

Introduction

Against this backdrop, in the summer of 2023 I was eating a bowl of Lucky Charms for breakfast one morning. The box was nearly empty, and I thought to myself, “I’d almost rather throw the rest of this box away and open up a brand new box of Lucky Charms!” Now, if you’re anything like me, or millions of other people, then you love Lucky Charms, and you’ve loved them for as long as you can remember. They truly are Magically Delicious! But sitting there that summer morning with a spoon in my hand it occurred to me that my latter bowls of the box just didn’t seem quite as magical as the first bowls had seemed to be. They had lost something. (The charms, of course.) But could it be my imagination? Was this effect real? And if so, was it detectable?

I happened to be teaching an upper-division undergraduate Probability & Statistics class at the time and the 4 students and I were determined to find out.

Experiment and data

After some discussion, we decided on our materials and methods.

Materials

  • Six (6) boxes of Family-size Lucky Charms (18.6oz, 527g)
  • Electronic kitchen scale
  • Two plastic “bowls”, Container A and Container B, measuring 40.125g and 28.375g respectively
  • Large bowl for discards, some trash bags, and other assorted ancillaries

The Lucky Charms were purchased from our local retailer—Wal-Mart. There was nothing special about \(n = 6\) boxes, it was simply the number of boxes a person could carry with two hands to the sixth floor of Cafaro Hall in a single trip. The kitchen scale was for measuring the weight of cereal, which the team thought might be important, and the scale would also help with data collection because we didn’t want to be overly preoccupied with sampling the exact same quantity of cereal every time.

Figure 2. Weighing in.
Figure 2. Weighing in.

Methods

For the purposes of this experiment, a “bowl” was taken to be approximately 1 serving as recommended on the box (1 cup or 36g), even though it is ridiculous for anybody but a tiny magical leprechaun to get by on a meager 36g of Lucky Charms for breakfast. The team was not particularly careful with the bowl size, and anything close to 1 cup was considered good enough. We were accounting for mass of cereal with the kitchen scale anyway, and were shooting for a healthy range of observed weights.

Figure 3. Pictured: Gavin Duwe (l) and Brenna Brocker (r)
Figure 3. Pictured: Gavin Duwe (l) and Brenna Brocker (r)

A bowl was scooped directly out of the box, weighed, and then poured out on a table surface for counting. The toasted oats were separated from the marshmallows and discarded. Then the following eight (8) charm types were recognized and their number recorded: Rainbows, Pink Hearts, Purple Horseshoes, Blue Moons, Green Clovers, Unicorns, Tasty Red Balloons and Orange Stars.

Figure 4. This was counted as 3 Purple Horseshoes.
Figure 4. This was counted as 3 Purple Horseshoes.

There were typically little bits and pieces of charms mixed about in the bowl; not every charm is 100% whole. To deal with this, the team attempted to classify the bit as to the type of charm (Green Clover, Blue Moon, etc.), and if the type could be determined, then that bit was counted as 1 in the respective category. If the bit was too small/nondescript for type identification then it was discarded.

The data

Data were collected over two separate class meetings—we had other statistical ground to cover, after all. Two pairs of students collected and counted at once, and I helped with the scale and recording a hard copy of the weight values as they were called out and entered into the computer.

Figure 5. Data doesn’t get much rawer than this.
Figure 5. Data doesn’t get much rawer than this.

The plastic container + cereal were weighed together each round, and then the weight of the container (measured at the start of the experiment) was subtracted from the observed total weight. The charms were entered into their respective columns and totaled.

Figure 6. Pictured: Kate Coppola.
Figure 6. Pictured: Kate Coppola.

Measured variables

  • Box: the box number (1 through 6)
  • Bowl: the sequential bowl for each box (ranges from 1 to around 12)
  • Observation: the observed order of bowls across boxes (1 to 69)
  • Totweight: weight of each plastic container + cereal, in grams
  • Weight: of cereal in grams, after subtracting the weight of the container
  • Hearts, Stars, etc: how many of that charm in that bowl
  • Totcharms: sum total of the assorted charms
Figure 7. Pictured: Haziq Rabbani
Figure 7. Pictured: Haziq Rabbani

Here is a look at the top of the data set (first 6 rows):

knitr::kable(head(Lucky), "simple")
Box Bowl Observation Totweight Weight Hearts Stars Horseshoes Clovers Moons Unicorns Balloons Rainbows Totcharms
1 1 1 60.52 32.145 5 5 14 3 4 2 1 4 38
1 2 2 69.92 41.545 7 1 7 8 6 1 5 4 39
1 3 3 66.18 37.805 5 4 4 4 6 6 4 6 39
1 4 4 63.82 35.445 5 6 8 1 4 7 4 2 37
1 5 5 75.54 47.165 4 1 5 4 7 2 2 6 31
1 6 6 77.28 48.905 6 4 4 4 8 3 1 6 36

The average Weight was approximately 46.3g, the naximum number of a particular charm in any one bowl was 15 (Pink Hearts tied with Purple Horseshoes), and there could be many, many other statistics and tables of interest in this data set to explore. For the task at hand, though, we are primarily concerned with Totcharms, and how that response variable is related to Bowl, and maybe Weight to a lesser extent.

Here is a graph of Totcharms by Bowl, colored by Box.

Lucky |> ggplot(aes(x = Bowl, y = Totcharms, color = Box)) + 
  geom_point(size = 3) +
  labs(y = '# Charms') -> p1
p1

Here we see a clear decreasing trend in Totcharms as Bowl increases, and the pattern is surprisingly linear. There may be a slight curvature. The colors are difficult to pick out, so let’s make a line plot and highlight a couple of series.

sizes <- c(2, 1, 2, 1, 1, 1)
alphas <- c(1, 0.2, 1, 0.2, 0.2, 0.2)
Lucky |> ggplot(aes(x = Bowl, y = Totcharms)) +
  geom_line(aes(colour = Box, linewidth = Box, alpha = Box)) +
  scale_discrete_manual("linewidth", values = sizes) +
  scale_alpha_manual(values = alphas, guide = "none")

We can see a general trend downward for each series, but the path to get there varies for every box. Notice how Box 3 starts high and stays high for a few bowls before dropping off smoothly until after Bowl 10 when it crashes. Look how Box 1 starts relatively low, but then increases after Bowl 5, peaks at Bowl 8, then nosedives down to Bowl 12. It’s as if the charms were settled near the top of Box 3, but were more concentrated in a pocket near the middle of Box 1. Some boxes bounce around a great deal, while other boxes are more of a straight line downward. Put it all together, though, and the overall trend is decreasing and linear. Note that all boxes lasted to Bowl 11, but only 2 boxes had 12 bowls, and a single box (Box 4) made it to Bowl 13.

Now let’s take a look at Totcharms versus Weight.

Lucky |> ggplot(aes(x = Weight, y = Totcharms, color = Box)) + 
  geom_point(size = 3) +
  labs(x = 'Weight (g)', y = '# Charms') -> p2
p2

This is a much noisier plot—as expected. We have a nice range of weights, from a minimum under 30g up to a maximum near 70g. Note that the lowest weights are usually associated with the last bowl of the box: there isn’t enough cereal remaining to make a full 36g bowl. And notice the one bowl which was extraordinarily heavy. It isn’t obvious what the explanation for that is, but if we dig a little deeper and plot Weight versus Bowl we can get a hint:

Lucky |> ggplot(aes(x = Bowl, y = Weight, color = Box)) + 
  geom_point(size = 3) + ylim(5, 75) +
  labs(y = 'Weight (g)') -> p3
p3

Now we see that the extra-heavy bowl was the last Bowl = 12 of Box = 1. The reason behind that particular data point has unfortunately been lost to the sands of time, but since it was the first box the team had ever finished, near the end it may have been difficult to judge how much was left, and perhaps the rest was poured into that last bowl—I personally do the same thing all the time at breakfast when I am nearing the end of a box of cereal. Looking at the points on the graph, if that 12th bowl of almost 70g had been split into (say) two bowls of 40g and 30g, respectively, then we would have had two boxes that lasted to 13 bowls instead of one, and maybe the models below would have fit slightly better. Alas! We will never know. Such is the scientific enterprise.

While the relationship between Totcharms and Weight does not appear to be strongly linear, there is a hidden relationship between Totcharms, Bowl, and Weight, which is best visualized with a 3D plot.

library(plotly)
Lucky$Box <- as.factor(Lucky$Box)
fig <- plot_ly(Lucky, x = ~Bowl, y = ~Weight, z = ~Totcharms, color = ~Box)
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'Bowl'),
                                   yaxis = list(title = 'Weight (g)'),
                                   zaxis = list(title = '# Charms')),
                      legend=list(title=list(text='Box')))
fig

Note that the 3D plot is interactive. Go ahead, spin it around, zoom, pan—check it out. If you spin it around just right you will see that the dots lie more-or-less on a flat plane in 3D-space. This is exactly the kind of relationship we are looking for in a multiple linear regression model (we’ll get to that in a minute).

Model fitting

Now let’s try to quantify the linear relationship between these variables. We will start with a simple linear regression model relating Totcharms to Bowl.

For Bowl

Here is the model:

mod1 <- lm(Totcharms ~ Bowl, data = Lucky)
summary(mod1)
## 
## Call:
## lm(formula = Totcharms ~ Bowl, data = Lucky)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7629  -5.7629  -0.4327   6.2277  22.2277 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  55.1309     2.1237  25.960  < 2e-16 ***
## Bowl         -2.6698     0.2985  -8.945 4.81e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.313 on 67 degrees of freedom
## Multiple R-squared:  0.5442, Adjusted R-squared:  0.5374 
## F-statistic: 80.01 on 1 and 67 DF,  p-value: 4.807e-13

We see that Bowl is a highly significant predictor for Totcharms. The slope is approximately \(-2.7\), in other words, for each additional bowl of Lucky Charms eaten we estimate the average Totcharms to decrease by 2.7 charms. Our coefficient of determination is \(R^2 = 0.5442\), that is, approximately 54% of the variance in Totcharms is explained by the regression model with Bowl as a predictor. There should be a whole discussion included here about residual analysis which we are going to skip, but suffice it to say that the residual plots are relatively well-behaved. Let’s check out a fitted line plot with confidence bands for the regression line (the default):

p1 + geom_smooth(method = "lm", aes(group=1), colour="black")

That’s a nice relationship with a clear decreasing trend.

For Weight

We will do the same thing for Weight, ignoring Bowl for the moment. Here we go:

mod2 <- lm(Totcharms ~ Weight, data = Lucky)
summary(mod2)
## 
## Call:
## lm(formula = Totcharms ~ Weight, data = Lucky)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.0151  -8.7745   0.6901   7.8328  24.4701 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  22.1370    10.5650   2.095   0.0399 *
## Weight        0.3502     0.2256   1.552   0.1254  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.1 on 67 degrees of freedom
## Multiple R-squared:  0.0347, Adjusted R-squared:  0.02029 
## F-statistic: 2.409 on 1 and 67 DF,  p-value: 0.1254

Here too, Weight is a highly significant predictor of Totcharms. The slope is approximately \(0.35\), in other words, for each additional 1g of Lucky Charms in a bowl we estimate the average Totcharms to increase by 0.35 charms. This makes sense: more cereal, more charms. The coefficient of determination is pretty bad: \(R^2 = 0.0347\), that is, approximately NONE% of the variance in Totcharms is explained by the regression model with Weight as a predictor. (!!) That’s okay, Weight was more of a supplementary measure to help control for variability in the scooped cereal amounts. The residual analysis isn’t as pretty here, which makes sense, and we would expect some problems anyway with the extreme values on the high and low ends. For the sake of completeness we will include another fitted line plot:

p2 + geom_smooth(method = "lm", aes(group=1), colour="black")

I had thought to use the ggpubr package to put these fitted-line plots together and try to save some space in the discussion, but the plots were rather cramped and not very informative. Anyway, this is what I would have done:

# library(ggpubr)
# ggarrange(p1 + geom_smooth(method = "lm", aes(group=1), colour="black"),
#           p2 + geom_smooth(method = "lm", aes(group=1), colour="black"),
#           align = 'h', labels=c('A', 'B'), legend = "right",
#           common.legend = TRUE)

Multiple regression

Now we get to the fun part: we’ve explored the relationship between Totcharms and Bowl and Totcharms ~ Weight individually, but what if we put them together? Let’s find out:

mod3 <- lm(Totcharms ~ Bowl + Weight, data = Lucky)
summary(mod3)
## 
## Call:
## lm(formula = Totcharms ~ Bowl + Weight, data = Lucky)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8825  -5.4425  -0.9975   5.2475  26.5304 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.3168     6.8655   4.853 7.78e-06 ***
## Bowl         -2.7552     0.2796  -9.855 1.35e-14 ***
## Weight        0.4819     0.1452   3.318  0.00148 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.754 on 66 degrees of freedom
## Multiple R-squared:  0.6094, Adjusted R-squared:  0.5976 
## F-statistic: 51.49 on 2 and 66 DF,  p-value: 3.363e-14

Check it out! Both Bowl and Weight are still highly significant predictors of Totcharms. The slopes are nearly identical, but the slope for Weight increased to nearly 0.5 charms for each additional 1g of cereal. The slope for Bowl is still \(-2.7\). Our (adjusted) Multiple \(R^2\) has increased to almost 60%—this is notable considering the small dataset, the overall noise level, and some questionable design choices for the experiment (every marshmallow bit counts as 1, etc.). It’s kind of a surprise that the data weren’t more mischievous and belligerent. If only more datasets we work with on a daily basis were so well-behaved.

Add the regression plane

We’ve got to do this one. It is a really cool plot to see and play with, but the code is more involved than the other examples above. Never fear, you can check it all out in this GitHub Gist here. Let’s get on with the plot:

Interactive 3D-plots are a lot of fun. With plotly and RStudio and GitHub so widely available these days, sharing such things with other people is easier than ever before. I hope you enjoy playing with the above as much as I have.

As a final remark, in the tl;dr paragraph at the beginning we claimed that the interaction between Bowl and Weight is not significant. It isn’t. We leave that for the reader to verify independently via something like the following:

summary(lm(Totcharms ~ Bowl * Weight, data = Lucky))

Discussion and questions

I originally thought that either the whole thing would turn out to be just my imagination, or that the effect would be too small to detect without LOTS AND LOTS of boxes of Lucky Charms. Both turned out to be false. The effect is real, and it is big enough to detect with a handful of boxes.

The standard explanation on physical grounds would probably consider a box of Lucky Charms as a simple mixture of cereal and marshmallows, with the notion that over time, due to many factors affecting the boxes such as jostling during transport, placement on store shelves, transit to the home, and activity on the shelf, we should expect a certain amount of settling in the contents of the box, with the less-dense marshmallows tending toward the top, and the slightly more-dense toasted oats tending toward the bottom. This is logical, anyway. But there are several open questions

Next steps

Since the original experiment in Summer 2023, I’ve run the experiment a couple more times with different groups of students. The first was in November 2023 with middle schoolers as part of YSU MegaMath, and the second was in February 2024 with high schoolers as part of YSU MathFest. I didn’t give the MegaMath students very clear instructions and before I knew it all teams had removed the plastic bags of cereal from the boxes and were scooping from the middle of the bag. I don’t blame them; it is a much easier method of scooping cereal, certainly, but it mostly destroys the natural density-sorting of the marshmallows, the key underlying factor we expect is at play. By the time MathFest came along I was better organized and wrote a data collection sheet (which you can find here) with more detailed guidance and instructions. Those data are stored in Github in the extraData directory.

Moving forward, I would like to collect more data and get ever-better estimates of the Lucky Charm dropoff, and search for procedures that effectively mix the cereal to better distribute the marshmallows throughout the duration of the Lucky Charms box’s lifetime. As a result maybe the first bowl of the box won’t be quite so magical, but on the other hand, maybe those final bowls won’t feel like such a chore, waiting for the next brand new box of Lucky Charms!

Acknowledgements

This experiment and these results would not have been possible without the enthusiasm and diligent attention to detail of all four STAT 3743 students of Summer 2023: Brenna Brocker, Kate Coppola, Gavin Duwe, and Haziq Rabbani. I would also like to thank the Department of Mathematics and Statistics at Youngstown State University for supporting this research and supporting additional data collection at YSU MegaMath and YSU MathFest.

References and code examples